Star allele search: a pharmacogenetic annotation database and user-friendly search tool of publicly available 1000 Genomes Project biospecimens

Here we describe a new public pharmacogenetic (PGx) annotation database of a large (n = 3,202) and diverse biospecimen collection of 1000 Genomes Project cell lines and DNAs. The database is searchable with a user friendly, web-based tool (www.coriell.org/StarAllele/Search). This resource leverages existing whole genome sequencing data and PharmVar annotations to characterize *alleles for each biospecimen in the collection. This new tool is designed to facilitate in vitro functional characterization of *allele haplotypes and diplotypes as well as support clinical PGx assay development, validation, and implementation. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-09994-6.


Background
Pharmacogenomics (PGx) holds the potential to improve medication management by increasing efficacy and by reducing toxicity [1][2][3][4][5][6][7].Translating pharmacogenomic research into clinical care, however, requires a robust inter-disciplinary infrastructure [8,9].Characterizing the full range of functionally relevant human pharmacogenetic variation is limited by the documented underrepresentation of many communities living in the United States and around the world [10][11][12][13][14][15][16], and this effort would benefit from a large and diverse collection of publicly available and well-characterized cell lines.Such a resource would facilitate a more comprehensive understanding of pharmacogene variation and in vitro drug response [17][18][19][20][21][22][23].Moreover, a well-characterized and diverse set of publicly available and renewable DNA samples would benefit the clinical communities that require positive and negative controls for assay development, validation, implementation, and proficiency testing for robust PGx testing.
The Genetic Reference and Testing Materials Coordination Program (GeT-RM) has used a variety of clinical testing methods to characterize lymphoblastoid cell line (LCL) DNAs for 28 pharmacogenes [24], and more recently has incorporated next generation sequencing data for the characterization of CYP2D6 [25] (n = 179), as well as CYP2C8, CYP2C9 and CYP2C19 (n = 137) [26].Here we describe a complementary PGx annotation resource that includes a significantly larger set (n = 3,202) of renewable and publicly available 1000 Genomes Project LCLs and DNAs available through the National Human Genome Research Institute (NHGRI) Sample Repository for Human Genetic Research (https:// catal og.corie ll.org/1/ NHGRI) and the National Institute of General Medical Sciences (NIGMS) Human Genetic Cell Repository (https:// catal og.corie ll.org/1/ NIGMS).This new annotation resource leverages 30x whole genome sequencing (WGS) data [27], is downloadable (Table S1) and may be searched with a user-friendly, web-based tool, Star Allele Search (www.corie ll.org/ StarA llele/ Search).We leveraged existing publicly available 30x coverage WGS data from 3,202 samples generated and phased by the New York Genome Center (NYGC) [27].The detailed description of the data collection and analysis can be found in Byrska-Bishop et al. [27].Briefly, 3,202 samples from the 1000 Genomes Project collection were selected for inclusion [27] in the WGS data collection (Table 1).The sample set includes 2,504 unrelated individuals as well as 698 relatives (that together complete 602 trios) [27], and the WGS data were collected with an Illumina NovaSeq 6000 System [27].The raw WGS data were aligned to the GRCh38 reference genome, and variant calling was performed with GATK [27,28].The WGS variant information was additionally phased into haplotypes; autosomal single nucleotide variants (SNVs) and insertion / deletions (INDELs) were statistically phased using SHAPEIT-duoHMM with pedigree-based correction [27,29,30].
The detailed description of the ursaPGx annotation can be found here [34].Briefly, for each non-CYP2D6 pharmacogene, the star allele defining variants according to PharmVar are extracted from the phased VCF file, and the annotation is assigned when all star allele defining variants are present for a given VCF haplotype.In cases where no complete match between the phased haplotype and any PharmVar star allele occurs, the haplotype is annotated as ambiguous (Amb).The complete list of variants included in the phased VCF used for non-CYP2D6 star allele annotations can be searched at the following  [34].The detailed description of the Cyrius annotation approach is described by Chen et al. 2021 [35]; the most relevant details to Star Allele Search are as follows.Cyrius first infers the combined number of CYP2D6 and CYP2D7 copies from the WGS BAM files using the reads mapped to either gene and then uses 117 variants to further differentiate between CYP2D6 and CYP2D7 reads for gene specific copy number inference [35].The Cyrius output differentiates several classes of annotations (https:// github.com/ Illum ina/ Cyrius).For the purposes of Star Allele Search, and to be as consistent as possible with the ursaPGx annotation approach for non-CYP2D6 pharmacogenes, we retained only those annotations where Cyrius indicates a unique and non-ambiguous match to a given PharmVar *allele annotation ("Filter" = "PASS" indicating a passing, confident call, and "Call_info" = "unique_ match" indicating a specific match to the annotated PharmVar *allele) in the sample JavaScript object notation (json) output file.More detail about the Cyrius annotation for each sample included in the json output, including the specific variants used for *allele annotation for each sample are included in Table S2.

Utility and discussion
Here we describe a new public PGx annotation database with a user friendly, web-based search tool of associated lymphoblastoid cell line and DNA biospecimens.This new resource complements existing databases generated by GeT-RM; while GeT-RM works directly with clinical laboratories to develop robust PGx annotated biospecimens designed to serve as reference materials for genetic testing, this effort is extremely involved and not easily scalable to larger collections of biospecimens.
This new annotation database therefore offers a slightly less robust characterization of a significantly larger collection of diverse biospecimens (Table 1) to support PGx related research efforts and to serve as a starting point for clinical testing communities to identify potentially relevant reference materials for their testing needs.More specifically, Star Allele Search uses a single WGS dataset [27] as well as a single annotation approach.These choices maximize consistency and transparency across all of the biospecimen annotations in the database and are thereby well suited for a large database of thousands of samples.Any researcher interested in using Star Allele Search annotations can view the specific variants included in each *allele annotation (for non-CYP2D6 pharmacogenes at https:// www.corie ll.org/ SNPSe arch/ WGS, and for CYP2D6 in Table S2), and can view each specific *allele annotation definition at PharmVar (https:// www.pharm var.org/ genes).Moreover, as PharmVar releases new versions of their annotations, we are well positioned to periodically update a corresponding version of Star Allele Search shortly thereafter.However, the relatively large size of the biospecimen set is not well suited to the more robust GeT-RM approach that leverages sequencing data collected from multiple laboratories together with multiple annotation analysis pipelines, and then constructs a consensus *allele annotation for each sample for each included pharmacogene [25,26].
To assess the quality and accuracy of the PGx annotation database, we compared overlapping samples that were already characterized by GeT-RM using next-generation sequencing data, which are available for CYP2C8, CYP2C9, CYP2C19, and CYP2D6 [25,26].In total, we identified 87 overlapping samples between GeT-RM and the current annotation dataset [25,26].We found 100%, 99% and 97% concordance, respectively between our annotation and the GeT-RM NGS consensus annotation for CYP2C8, CYP2C19, and CYP2C9 [26], and we found 94% concordance between our annotation and the GeT-RM NGS consensus annotation for CYP2D6 [25].
Our CYP2C19 comparison identified a discrepancy for a single sample (NA19122).The GeT-RM NGS consensus is *2/*35 [26], while our annotation is *2|*Amb.We note that as described above, our approach requires a complete match between a given phased haplotype and all of the PharmVar defining variants for a given star allele.For NA19122, the first haplotype included all of the variants required to annotate *2 (non-reference alleles for rs12769205, rs4244285, and rs3758581), consistent with GeT-RM [26]; however, the second haplotype in our phased VCF file includes both variants required to annotate *35 (non-reference alleles for rs12769205 and rs3758581) as well as a non-reference allele at rs17882687, which in our approach precludes it from an unambiguous call of *35 or *15.
We identified discordant CYP2C9 star allele annotations for three samples.Our approach annotated two samples (NA19143 and NA19213) as *1|*1 while the GeT-RM NGS consensus is *1/*6 [26].This discrepancy is due to the limitation of the WGS phased VCF file we used which unfortunately does not contain rs9332131, the single base deletion that defines *6.We annotated the third discordant sample (HG01190) to be *61|*1, whereas the GeT-RM NGS consensus is *2/*61 [26].We believe this difference is due to differences in variant calling and phasing approaches.In the phased VCF we used, this sample is heterozygous for both variants required to annotate *61 (rs1799853 and rs202201137), and both of these variants occur on the first haplotype of the sample.Here we also note that while the consensus annotation is *2/*61, a minority of the groups participating in the study annotated this sample as *1/*61 [26].
In total, star allele search includes 663 diplotypes across 13 pharmacogenes (Table 2, Table S1), excluding diplotypes with one or two ambiguous (i.e., Amb) allele calls.Each unique diplotype and associated diplotype frequency in the database is detailed in Table S4, and each unique *allele haplotype and associated haplotype allele frequency in the database is detailed in Table S5.To determine the contribution of the larger sample set included in the database, we identified 3, 3, 7, and 10 new *alleles, respectively in the dataset relative to GeT-RM [25,26] S6).We performed a similar comparison for unique pairs of *alleles (diplotype combinations).We chose to conservatively exclude ambiguous calls, copy number variants and complex CYP2D6 structural variants and identified 12, 17, 23, and 129 new diplotypes, respectively, for CYP2C8, CYP2C19, CYP2C9, and CYP2D6 (Fig. 1, Table S6).This new star allele annotated biospecimen database is of use for a wide range of applications.For example, researchers interested in functionally characterizing *alleles of interest can use the resource to choose LCLs with the most relevant diplotype combinations; researchers interested in developing new PGx assays can use the resource to benchmark performance; and clinical laboratories can use the resource to minimize the number of positive and negative control DNAs needed for a given PGx test.
We have additionally developed Star Allele Search (Fig. 2), which is a web-based search tool of the new PGx biospecimen annotation database to facilitate these types of research and clinical applications.In addition to this new database and search tool, users can choose to search the WGS data one variant at a time, up to one hundred variants at a time, or by gene (https:// www.corie ll.org/ SNPSe arch/ WGS; [27]).Users can also search gene expression data collected from a subset of the *allele annotated LCLs (n = 462) (http:// omicd ata.corie ll.org/ geuv-expre ssion-brows er/; [37]).
All of these genomic data search tools are designed to complement each other to ensure researchers have a simple way to search a large collection of biospecimen genetic, genomic, and transcriptomic profiles with a web-based interface that does not require bioinformatic skill or experience.For example, a researcher interested in developing a CYP2C19 assay could first view, sort, filter and/or download a comma-separated value (CSV) file of all of the CYP2C19 variants included in the WGS dataset with a single HUGO symbol search (https:// www.corie ll.org/ SNPSe arch/ WGS) to confirm the variants of interest are present in the data; then view, sort, filter and/or download a CSV of the annotated CYP2C19 *alleles for the entire sample set with Star Allele Search (www.corie ll.org/ StarA llele/ Search) to identify the biospecimens with the relevant diplotypes; if an alternate annotation scheme is needed (i.e., not PharmVar), the researcher can view, sort, filter, and/ or download a CSV of up to 100 individual CYP2C19 variants at a time to investigate any alternate combination of variants needed for the alternative annotation scheme (https:// www.corie ll.org/ SNPSe arch/ WGS).
It is important to note all of the limitations of our approach and annotations.Our database annotations are based on short read (150 base pair, paired-end reads), 30x coverage WGS, and computational phasing [27].Any error in variant calling or missing single nucleotide or larger structural variation, as well as any error in phasing in the input VCF will propagate into annotation errors (for the non-CYP2D6 pharmacogenes included in Star Allele Search).In addition, any error or missing single nucleotide or larger structural variation in the BAM files analyzed with Cyrius used for CYP2D6 annotation will similarly produce annotation errors (in CYP2D6 annotations included in Star Allele Search).While this is the most robust, large-scale WGS dataset available for this sample set at present, we anticipate that as long-read sequencing becomes more affordable and more accessible, that phase uncertainty (particularly for rare variants) will significantly go down and structural variation resolution will significantly improve.We also employed PharmVar annotation for our database and chose a strict matching requirement for each *allele annotation.This choice resulted in several ambiguous biospecimen calls in cases where one or both phased haplotypes were not an exact match to any PharmVar defined *allele.The number of pharmacogenes annotated in our database is limited by the number of genes annotated by PharmVar.Currently PharmVar includes thirteen pharmacogenes.Although the number of genes is limited, the clinical impact of these pharmacogenes is significant with CYP2C9, CYP2C19, CYP2D6, CYP3A4, CYP3A5, CYP2A6, CYP2B6, and CYP2C8 alone metabolizing the vast majority of drugs in clinical use (e.g.[38],).Our automated approach, however, facilitates version updates to Star Allele Search as PharmVar releases new annotation versions with additional pharmacogenes.

Fig. 2
Fig. 2 Star Allele database search results example for CYP2C19.Figure 2 displays a screen shot of the web-based Star Allele Search.This example is displaying results for CYP2C19, chosen from the dropdown search on the top, left-hand side of the page.The user may choose to view the list of PharmVar annotated pharmacogenes, the NCBI entry for the selected gene, the associated Gene Search page (which will display all of the variants included in the 30x WGS dataset for the selected gene), or to return to the general search page.The user may choose to export the Star Allele search results to a CSV file by clicking the green button on the right-hand side of the page.The user may additionally choose to filter by a given Star Allele diplotype, and this filtered drop down also displays the number of samples with each corresponding diplotypes.Figure 2 displays results after filtering for *2|*2 diplotypes in the database

Figure 2
Fig. 2 Star Allele database search results example for CYP2C19.Figure 2 displays a screen shot of the web-based Star Allele Search.This example is displaying results for CYP2C19, chosen from the dropdown search on the top, left-hand side of the page.The user may choose to view the list of PharmVar annotated pharmacogenes, the NCBI entry for the selected gene, the associated Gene Search page (which will display all of the variants included in the 30x WGS dataset for the selected gene), or to return to the general search page.The user may choose to export the Star Allele search results to a CSV file by clicking the green button on the right-hand side of the page.The user may additionally choose to filter by a given Star Allele diplotype, and this filtered drop down also displays the number of samples with each corresponding diplotypes.Figure 2 displays results after filtering for *2|*2 diplotypes in the database

Figure 2
Fig. 2 Star Allele database search results example for CYP2C19.Figure 2 displays a screen shot of the web-based Star Allele Search.This example is displaying results for CYP2C19, chosen from the dropdown search on the top, left-hand side of the page.The user may choose to view the list of PharmVar annotated pharmacogenes, the NCBI entry for the selected gene, the associated Gene Search page (which will display all of the variants included in the 30x WGS dataset for the selected gene), or to return to the general search page.The user may choose to export the Star Allele search results to a CSV file by clicking the green button on the right-hand side of the page.The user may additionally choose to filter by a given Star Allele diplotype, and this filtered drop down also displays the number of samples with each corresponding diplotypes.Figure 2 displays results after filtering for *2|*2 diplotypes in the database

Table 1
includes a summary of the publicly available 1000 Genomes Project biospecimens included in the star allele annotation database.The majority of the samples (n = 3,023) are available through the NHGRI Sample Repository for Human Genetic Research (https:// catal og.corie ll.org/1/ NHGRI), and the collection of Utah Residents (Centre d'Etude du Polymorphisme Humain (CEPH)) with Northern and Western European Ancestry biospecimens (n = 179) are available through the NIGMS Human Genetic Cell Repository (https:// catal og.corie ll.org/1/ NIGMS).Table S1 includes each individual NHGRI Sample Repository for Human Genetic Research and NIGMS Human Genetic Cell Repository identifier for all of the 1000 Genomes Project biospecimens annotated in the star allele annotation database.

Table 1
List of included populationsa Samples available through the NIGMS Human Genetic Cell Repository [35]ite, either by specific rsid (up to 100 rsids can be included in a single search) or by HUGO gene symbol (https:// www.corie ll.org/ SNPSe arch/ WGS).CYP2D6 annotations were generated with Cyrius[35]via ursaPGx

Table 2
List of PharmVar annotated pharmacogenes, number of diplotypes and *alleles included in database a Excluding star alleles with structural and copy number variation